beta 0
Reviews: Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function
The submission studies the adversarial online learning in episodic loop-free Markov decision processes. The importance of this work is that it is the first to provide the understanding to an adversarial online learning problem where the transition function is unknown, the loss functions are changing, and each feedback is bandit. The related work clearly describe the line of this research field from fixing an unknown transition and an unknown loss function to the setting studied in this submission. Although the MDPs considered in the submission is L-layered and loop-free, the results and the analysis pave the way for general MDPs. The main idea is the design of the confidence sets to include the optimal occupancy measure which induces the optimal policy.
Understanding overfitting in random forest for probability estimation: a visualization and simulation study
Barreñada, Lasai, Dhiman, Paula, Timmerman, Dirk, Boulesteix, Anne-Laure, Van Calster, Ben
Random forests have become popular for clinical risk prediction modelling. In a case study on predicting ovarian malignancy, we observed training c-statistics close to 1. Although this suggests overfitting, performance was competitive on test data. We aimed to understand the behaviour of random forests by (1) visualizing data space in three real world case studies and (2) a simulation study. For the case studies, risk estimates were visualised using heatmaps in a 2-dimensional subspace. The simulation study included 48 logistic data generating mechanisms (DGM), varying the predictor distribution, the number of predictors, the correlation between predictors, the true c-statistic and the strength of true predictors. For each DGM, 1000 training datasets of size 200 or 4000 were simulated and RF models trained with minimum node size 2 or 20 using ranger package, resulting in 192 scenarios in total. The visualizations suggested that the model learned spikes of probability around events in the training set. A cluster of events created a bigger peak, isolated events local peaks. In the simulation study, median training c-statistics were between 0.97 and 1 unless there were 4 or 16 binary predictors with minimum node size 20. Median test c-statistics were higher with higher events per variable, higher minimum node size, and binary predictors. Median training slopes were always above 1, and were not correlated with median test slopes across scenarios (correlation -0.11). Median test slopes were higher with higher true c-statistic, higher minimum node size, and higher sample size. Random forests learn local probability peaks that often yield near perfect training c-statistics without strongly affecting c-statistics on test data. When the aim is probability estimation, the simulation results go against the common recommendation to use fully grown trees in random forest models.
On Optimizing Hyperparameters for Quantum Neural Networks
Herbst, Sabrina, De Maio, Vincenzo, Brandic, Ivona
The increasing capabilities of Machine Learning (ML) models go hand in hand with an immense amount of data and computational power required for training. Therefore, training is usually outsourced into HPC facilities, where we have started to experience limits in scaling conventional HPC hardware, as theorized by Moore's law. Despite heavy parallelization and optimization efforts, current state-of-the-art ML models require weeks for training, which is associated with an enormous $CO_2$ footprint. Quantum Computing, and specifically Quantum Machine Learning (QML), can offer significant theoretical speed-ups and enhanced expressive power. However, training QML models requires tuning various hyperparameters, which is a nontrivial task and suboptimal choices can highly affect the trainability and performance of the models. In this study, we identify the most impactful hyperparameters and collect data about the performance of QML models. We compare different configurations and provide researchers with performance data and concrete suggestions for hyperparameter selection.
Class-wise and reduced calibration methods
Panchenko, Michael, Benmerzoug, Anes, Delgado, Miguel de Benito
For many applications of probabilistic classifiers it is important that the predicted confidence vectors reflect true probabilities (one says that the classifier is calibrated). It has been shown that common models fail to satisfy this property, making reliable methods for measuring and improving calibration important tools. Unfortunately, obtaining these is far from trivial for problems with many classes. We propose two techniques that can be used in tandem. First, a reduced calibration method transforms the original problem into a simpler one. We prove for several notions of calibration that solving the reduced problem minimizes the corresponding notion of miscalibration in the full problem, allowing the use of non-parametric recalibration methods that fail in higher dimensions. Second, we propose class-wise calibration methods, based on intuition building on a phenomenon called neural collapse and the observation that most of the accurate classifiers found in practice can be thought of as a union of K different functions which can be recalibrated separately, one for each class. These typically out-perform their non class-wise counterparts, especially for classifiers trained on imbalanced data sets. Applying the two methods together results in class-wise reduced calibration algorithms, which are powerful tools for reducing the prediction and per-class calibration errors. We demonstrate our methods on real and synthetic datasets and release all code as open source at https://github.com/appliedAI-Initiative
Cost Function
While dealing with Linear Regression we can have multiple lines for different values of slopes and intercepts. But the main question that arises is which of those lines actually represents the right relationship between the X and Y and in order to find that we can use the Mean Squared Error or MSE as the parameter. For linear regression, this MSE is nothing but the Cost Function. Mean Squared Error is the sum of the squared differences between the prediction and true value. And the output is a single number representing the cost. So the line with the minimum cost function or MSE represents the relationship between X and Y in the best possible manner.
Predictive Information Accelerates Learning in RL
Lee, Kuang-Huei, Fischer, Ian, Liu, Anthony, Guo, Yijie, Lee, Honglak, Canny, John, Guadarrama, Sergio
The Predictive Information is the mutual information between the past and the future, I(X_past; X_future). We hypothesize that capturing the predictive information is useful in RL, since the ability to model what will happen next is necessary for success on many tasks. To test our hypothesis, we train Soft Actor-Critic (SAC) agents from pixels with an auxiliary task that learns a compressed representation of the predictive information of the RL environment dynamics using a contrastive version of the Conditional Entropy Bottleneck (CEB) objective. We refer to these as Predictive Information SAC (PI-SAC) agents. We show that PI-SAC agents can substantially improve sample efficiency over challenging baselines on tasks from the DM Control suite of continuous control environments. We evaluate PI-SAC agents by comparing against uncompressed PI-SAC agents, other compressed and uncompressed agents, and SAC agents directly trained from pixels.
A Simulation Model Demonstrating the Impact of Social Aspects on Social Internet of Things
In addition to seamless connectivity and smartness, the objects in the Internet of Things (IoT) are expected to have the social capabilities -- these objects are termed as ``social objects''. In this paper, an intuitive paradigm of social interactions between these objects are argued and modeled. The impact of social behavior on the interaction pattern of social objects is studied taking Peer-to-Peer (P2P) resource sharing as an example application. The model proposed in this paper studies the implications of competitive vs. cooperative social paradigm, while peers attempt to attain the shared resources / services. The simulation results divulge that the social capabilities of the peers impart a significant increase in the quality of interactions between social objects. Through an agent-based simulation study, it is proved that cooperative strategy is more efficient than competitive strategy. Moreover, cooperation with an underpinning on real-life networking structure and mobility does not negatively impact the efficiency of the system at all; rather it helps.
Linear Regression with Gradient Descent from Scratch in Numpy
I strongly advise you to read the article linked above. It will set the foundations on the topic, plus some math is already discussed there. To start out, I'll define my dataset -- only three points that are in a linear relationship. I've chosen so few points only because the math will be shorter -- needless to say, the math won't be more complex for longer dataset, it would just be longer, and I don't want to make some stupid arithmetic mistake. Then I'll set coefficients beta 0 and beta 1 to some constant and define the cost function as Sum of Squared Residuals (SSR/SSE).
Gradient Descent Demystified in 5 Minutes
The algorithm starts off with setting initial values for coefficients -- you are free to set the values to whatever you like (just not a string or boolean), but the common practice is to set them to 0. If I have two coefficients, let's say beta 0 and beta 1, I would set them to zero initially: Now just to keep things simple let's say I'm dealing with a linear regression task, and those betas are my coefficients (beta 0 being the bias intercept). It's quite simple to read. You make a prediction, then subtract that prediction from the actual value, and you take the square of that. Now comes the part where you should know a bit of Calculus to fully understand what's going on. You need to calculate partial derivatives for each of the coefficients, so the coefficients can be updated later. Some time ago I've written an article on taking derivatives in Python, and it covers to a degree those topics: As my model has only two coefficients, I need to calculate two partial derivatives, one with respect to beta 0, and the other with respect to beta 1. Here's how: Now comes the part in which you take those two functions and do something known as epoch -- just a fancy word for a single iteration through the dataset.
Non-ergodic Convergence Analysis of Heavy-Ball Algorithms
Sun, Tao, Yin, Penghang, Li, Dongsheng, Huang, Chun, Guan, Lei, Jiang, Hao
In this paper, we revisit the convergence of the Heavy-ball method, and present improved convergence complexity results in the convex setting. We provide the first non-ergodic O(1/k) rate result of the Heavy-ball algorithm with constant step size for coercive objective functions. For objective functions satisfying a relaxed strongly convex condition, the linear convergence is established under weaker assumptions on the step size and inertial parameter than made in the existing literature. We extend our results to multi-block version of the algorithm with both the cyclic and stochastic update rules. In addition, our results can also be extended to decentralized optimization, where the ergodic analysis is not applicable.